251 research outputs found

    Using phonetic constraints in acoustic-to-articulatory inversion

    Get PDF
    The goal of this work is to recover articulatory information from the speech signal by acoustic-to-articulatory inversion. One of the main difficulties with inversion is that the problem is underdetermined and inversion methods generally offer no guarantee on the phonetical realism of the inverse solutions. A way to adress this issue is to use additional phonetic constraints. Knowledge of the phonetic caracteristics of French vowels enable the derivation of reasonable articulatory domains in the space of Maeda parameters: given the formants frequencies (F1,F2,F3) of a speech sample, and thus the vowel identity, an "ideal" articulatory domain can be derived. The space of formants frequencies is partitioned into vowels, using either speaker-specific data or generic information on formants. Then, to each articulatory vector can be associated a phonetic score varying with the distance to the "ideal domain" associated with the corresponding vowel. Inversion experiments were conducted on isolated vowels and vowel-to-vowel transitions. Articulatory parameters were compared with those obtained without using these constraints and those measured from X-ray data

    A concurrent curve strategy for formant tracking

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceAlthough automatic formant tracking has a wide range of potential applications it is still an open problem. We previously proposed the use of active curves that deform under the influence of the spectrogram energy. Each formant was tracked independently and a complex strategy was required to guarantee the overall formant tracking consistency. This paper describes how the interdependency between formants can be incorporated directly during the deformations of formant tracks. Iterative processes attached to each formant are interlaced. We experimented two strategies. The first consists in partitioning the spectrogram into exclusive regions, each region affiliated to a given formant. The second consists in adding a repulsion force between formants that prevents formant tracks to merge together. It turns out that the second strategy is more robust and does not necessitate a complex control strategy

    The WinSnoori user's manual version 1.32

    Get PDF
    Manuel technique.Using tools for investigating speech signals is an invaluable help to teach phonetics and more generally speech sciences. For several years we have undertaken the development of the software WinSnoori which is for both speech scientists as a research tool and teachers in phonetics as an illustration tool. It consists of five types of tools: to edit speech signals, to annotate phonetically or orthographically speech signals. WinSnorri offers tools to explore annotated corpora automatically, to analyse speech with several spectral analyses and monitor spectral peaks along time, to study prosody. Besides pitch calculation it is possible to synthesise new signals by modifying the F0 curve and/or the speech rate, to generate parameters for the Klatt synthesiser. A user friendly graphic interface and copy synthesis tools allows the user to generate files for the Klatt synthesiser easily. In the context of speech sciences Snorri can therefore be exploited for many purposes, among them, illustrating speech phenomena and investigating acoustic cues of speech sounds and prosody

    Snorri, a software for speech sciences

    Get PDF
    Colloque avec actes et comité de lecture.Using tools for investigating speech signals is an invaluable help to teach phonetics and more generally speech sciences. For several years we have undertaken the development of the software WinSnorri which is for both speech scientists as a research tool and teachers in phonetics as an illustration tool. It consists of five types of tools: * to edit speech signals, * to annotate phonetically or orthographically speech signals. WinSnorri offers tools to explore annotated corpora automatically, * to analyse speech with several spectral analyses and monitor spectral peaks along time, * to study prosody. Besides pitch calculation it is possible to synthesise new signals by modifying the F0 curve and/or the speech rate, * to generate parameters for the Klatt synthesiser. A user friendly graphic interface and copy synthesis tools allows the user to generate files for the Klatt synthesiser easily. In the context of speech sciences Snorri can therefore be exploited for many purposes, among them, illustrating speech phenomena and investigating acoustic cues of speech sounds and prosody

    A glottal chink model for the synthesis of voiced fricatives

    Get PDF
    International audienceThis paper presents a simulation framework that enables a glottal chink model to be integrated into a time-domain continuous speech synthesizer along with self-oscillating vocal folds. The glottis is then made up of two main separated components: a self-oscillating part and a constantly open chink. This feature allows the simulation of voiced fricatives, thanks to a self-oscillating model of the vocal folds to generate the voiced source, and the glottal opening that is necessary to generate the frication noise. Numerical simulations show the accuracy of the model to simulate voiced fricative, and also phonetic assimilation, such as sonorization and devoicing. The simulation framework is also used to show that the phonatory/articulatory space for generating voiced fricatives is different according to the desired sound: for instance, the minimal glottal opening for generating frica-tion noise is shorter for /z/ than for /Z/

    Méthodologie 3-way d'extraction d'un modèle articulatoire de la parole à partir des données d'un locuteur

    Get PDF
    National audienceFor speaking, a speaker sets in motion a complex set of articulators: the jaw that opens more or less, the tongue which takes many shapes and positions, the lips that allow him to leave the air escaping more or less abruptly, etc.. The best-known articulary model is the one of Maeda (1990), derived from Principal Component Analysis made on arrays of coordinates of points of the articulators of a speaker talking. We propose a 3-way analysis of the same data type, after converting tables into distances. We validate our model by predicting spoken sounds, which prediction proved almost as good as the acoustic model, and even better when co-articulation is taken into account.Pour parler, le locuteur met en mouvement un ensemble complexe d'articulateurs : la mâchoire qu'il ouvre plus ou moins la langue à laquelle il fait prendre de nombreuses formes et positions, les lèvres qui lui permettent de laisser l'air s'échapper plus ou moins brutalement, etc. Le modèle articulatoire le plus connu est celui de Maeda (1990), obtenu à partir d'Analyses en Composantes Principales faites sur les tableaux de coordonnées des points des articulateurs d'un locuteur en train de parler. Nous proposons ici une analyse 3-way du même type de données, après leur transformation en tableaux de distances. Nous validons notre modèle par la prédiction des sons prononcés, qui s'avère presque aussi bonne que celle du modèle acoustique, et même meilleure quand on prend en compte la co-articulation

    Autoencoder-Based Tongue Shape Estimation During Continuous Speech

    Get PDF
    International audienceVocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical articulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data's encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve's main components, i.e., the autoencoder's representation. We show how this approach allows imposing critical articulators' constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method

    Adaptation of cepstral coefficients for acoustic-to-articulatory inversion

    Get PDF
    International audienceAcoustic-to-articulatory inversion of speech signals via an analysis-by-synthesis method requires the comparison of natural and synthetic speech spectra either indirectly via formant frequencies, or directly via cepstral coefficients. This paper investigates several strategies of cepstral adaptation (affine transformation of cepstral coefficients, bilinear or piecewise linear frequency warping) when X-ray images of the speaker's vocal tract are available. These images enable the articulatory synthesis of a speech signal which fits the natural signal at best. It is thus possible to investigate the behavior of several cepstral adaptation procedures in order to select the best method, i.e. that which minimizes the deviation between synthetic and natural spectra. Our results show that the affine cepstral adaptation tends to flatten the spectral peaks, i.e. formants. Frequency warping techniques are thus more efficient all the more they can be supplemented by taking into account the spectral tilt

    Speech signal resampling by arbitrary rate

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceIn this paper we discussed issues related to resampling speech signal at arbitrary frequency by using interpolation methods. The implementation of four resampling methods, 1. direct interpolation, 2. Lagrange interpolation, 3. sine interpolation, and 4. Taylor series, is presented. These methods have been tested with some speech data and various resampling frequencies. The quality of the resampled speech signal is analyzed and evaluated by human listening. The experiment results showed that either for upsampling or downsampling, sine and Lagrange methods generate additive high frequency noise like metal sounds, but the direct and Taylor methods do not present such problems. The resampled speech by the direct and Taylor methods sounds more natural than that by sine and Langrange methods

    Amélioration de la précision de la resynthèse avec TD-PSOLA

    Get PDF
    Colloque avec actes et comité de lecture. nationale.National audienceCe papier décrit une méthode d'amélioration des modifications prosodiques réalisées à l'aide de la méthode PSOLA. PSOLA repose sur la décomposition du signalde parole en fenêtres de signal recouvrante synchronisées avec les périodes du fondamental. Le principal objectif est de préserver la cohérence entre les marques voisines tout en tenant compte de la structure temporelle des périodes du fondamental. Nous présentons d'abord les améliorations apportées à l'algoritme de pitch marquage qui ont consisté à supprimer les erreurs qui apparaissent lors de transitions formantiques très marquées. L'idée est d'utiliser une technique d'élagage des marques qui ne peuvent pas respecter la connaissance de la fréquence fondamentale. Les améliorations de la synthèse ont consisté à augmenter la précision temporelle en rééchantillonnant dynamiquement le signal de parole pour pouvoir le placer exactement où il faut lors de la synthèse. Ces deux améliorations réduisent fortement la présence de bruit entre les harmoniques ce qui permet d'obtenir des signaux de parole de très bonne qualité
    • …
    corecore